Goto

Collaborating Authors

 street segment


MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark

Ou, Yiwei, Ren, Xiaobin, Sun, Ronggui, Gao, Guansong, Jiang, Ziyi, Zhao, Kaiqi, Manfredini, Manfredo

arXiv.org Artificial Intelligence

Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, lack multimodal diversity and underrepresent dense, mixed-use street-level spaces, especially in non-Western urban contexts. To address these gaps, we introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in complex, pedestrian-only environments. The dataset comprises 78,575 annotated images and 2,512 video clips captured across 207 locations in a ~70,800 $\mathrm{m}^2$ open-air commercial district in Chengdu, China. Each image is labeled with precise GPS coordinates, timestamp, and textual metadata, and covers varied lighting conditions, viewpoints, and timeframes. MMS-VPR follows a systematic and replicable data collection protocol with minimal device requirements, lowering the barrier for scalable dataset creation. Importantly, the dataset forms an inherent spatial graph with 125 edges, 81 nodes, and 1 subgraph, enabling structure-aware place recognition. We further define two application-specific subsets -- Dataset_Edges and Dataset_Points -- to support fine-grained and graph-based evaluation tasks. Extensive benchmarks using conventional VPR models, graph neural networks, and multimodal baselines show substantial improvements when leveraging multimodal and structural cues. MMS-VPR facilitates future research at the intersection of computer vision, geospatial understanding, and multimodal reasoning. The dataset is publicly available at https://huggingface.co/datasets/Yiwei-Ou/MMS-VPR.


Grid-Based Projection of Spatial Data into Knowledge Graphs

Anjomshoaa, Amin, Schuster, Hannah, Polleres, Axel

arXiv.org Artificial Intelligence

The Spatial Knowledge Graphs (SKG) are experiencing growing adoption as a means to model real-world entities, proving especially invaluable in domains like crisis management and urban planning. Considering that RDF specifications offer limited support for effectively managing spatial information, it's common practice to include text-based serializations of geometrical features, such as polygons and lines, as string literals in knowledge graphs. Consequently, Spatial Knowledge Graphs (SKGs) often rely on geo-enabled RDF Stores capable of parsing, interpreting, and indexing such serializations. In this paper, we leverage grid cells as the foundational element of SKGs and demonstrate how efficiently the spatial characteristics of real-world entities and their attributes can be encoded within knowledge graphs. Furthermore, we introduce a novel methodology for representing street networks in knowledge graphs, diverging from the conventional practice of individually capturing each street segment. Instead, our approach is based on tessellating the street network using grid cells and creating a simplified representation that could be utilized for various routing and navigation tasks, solely relying on RDF specifications.


Bayesian Simultaneous Localization and Multi-Lane Tracking Using Onboard Sensors and a SD Map

Xia, Yuxuan, Stenborg, Erik, Fu, Junsheng, Hendeby, Gustaf

arXiv.org Artificial Intelligence

High-definition map with accurate lane-level information is crucial for autonomous driving, but the creation of these maps is a resource-intensive process. To this end, we present a cost-effective solution to create lane-level roadmaps using only the global navigation satellite system (GNSS) and a camera on customer vehicles. Our proposed solution utilizes a prior standard-definition (SD) map, GNSS measurements, visual odometry, and lane marking edge detection points, to simultaneously estimate the vehicle's 6D pose, its position within a SD map, and also the 3D geometry of traffic lines. This is achieved using a Bayesian simultaneous localization and multi-object tracking filter, where the estimation of traffic lines is formulated as a multiple extended object tracking problem, solved using a trajectory Poisson multi-Bernoulli mixture (TPMBM) filter. In TPMBM filtering, traffic lines are modeled using B-spline trajectories, and each trajectory is parameterized by a sequence of control points. The proposed solution has been evaluated using experimental data collected by a test vehicle driving on highway. Preliminary results show that the traffic line estimates, overlaid on the satellite image, generally align with the lane markings up to some lateral offsets.


A comparison of machine learning surrogate models of street-scale flooding in Norfolk, Virginia

McSpadden, Diana, Goldenberg, Steven, Roy, Binata, Schram, Malachi, Goodall, Jonathan L., Richter, Heather

arXiv.org Artificial Intelligence

Low-lying coastal cities, exemplified by Norfolk, Virginia, face the challenge of street flooding caused by rainfall and tides, which strain transportation and sewer systems and can lead to property damage. While high-fidelity, physics-based simulations provide accurate predictions of urban pluvial flooding, their computational complexity renders them unsuitable for real-time applications. Using data from Norfolk rainfall events between 2016 and 2018, this study compares the performance of a previous surrogate model based on a random forest algorithm with two deep learning models: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). This investigation underscores the importance of using a model architecture that supports the communication of prediction uncertainty and the effective integration of relevant, multi-modal features.


An agent-based approach to procedural city generation incorporating Land Use and Transport Interaction models

Santos, Luiz Fernando Silva Eugênio dos, Aranha, Claus, de Carvalho, André Ponce de Leon F

arXiv.org Artificial Intelligence

We apply the knowledge of urban settings established with the study of Land Use and Transport Interaction (LUTI) models to develop reward functions for an agent-based system capable of planning realistic artificial cities. The system aims to replicate in the micro scale the main components of real settlements, such as zoning and accessibility in a road network. Moreover, we propose a novel representation for the agent's environment that efficiently combines the road graph with a discrete model for the land. Our system starts from an empty map consisting only of the road network graph, and the agent incrementally expands it by building new sites while distinguishing land uses between residential, commercial, industrial, and recreational.


Generating Landmark Navigation Instructions from Maps as a Graph-to-Text Problem

Schumann, Raphael, Riezler, Stefan

arXiv.org Artificial Intelligence

Car-focused navigation services are based on turns and distances of named streets, whereas navigation instructions naturally used by humans are centered around physical objects called landmarks. We present a neural model that takes Open-StreetMap representations as input and learns to generate navigation instructions that contain visible and salient landmarks from human natural language instructions. Routes on the map are encoded in a location-and rotation-invariant graph representation that is decoded into natural language instructions. Our work is based on a novel dataset of 7,672 crowd-sourced instances that have been verified by human navigation in Street View. Our evaluation shows that the navigation instructions generated by our system have similar properties as human-generated instructions, and lead to successful human navigation in Street View. Current navigation services provided by the automotive industry or by Google Maps generate route instructions based on turns and distances of named streets. In contrast, humans naturally use an efficient mode of navigation based on visible and salient physical objects called landmarks. Route instructions based on landmarks are useful if GPS tracking is poor or not available, and if information is inexact regarding distances (e.g., in human estimates) or street names (e.g., for users riding a bicycle or on a bus). In our framework, routes on the map are learned by discretizing the street layout, connecting street segments with adjacent points of interest - thus encoding visibility of landmarks, and encoding the route and surrounding landmarks in a location-and rotation-invariant graph representation. Based on crowd-sourced natural language instructions for such map representations, a graph-to-text mapping is learned that decodes graph representations into natural language route instructions that contain salient landmarks. Our work is accompanied by a dataset of 7,672 instances of routes rendered on OpenStreetMap and crowd-sourced natural language instructions. The navigation instructions were generated by workers on the basis of maps including all points of interest, but no street names. Furthermore, the timenormalized success rate of human workers finding the correct goal location on Street View is at 66%. Since these routes can have a partial overlap with routes in the training set, we further performed an evaluation on completely unseen routes. The rate of produced landmarks drops slightly compared to human references, and the time-normalized success rate also drops slightly to 63%. While there is still room for improvement, our results showcase a promising direction of research, with a wide potential of applications in various existing map applications and navigation systems. Mirowski et al. (2018) published a subset of Street View covering parts of New York City and Pittsburgh.